Machine Learning Models
Decision Trees
Deep Learning works really well for homogenous data, like images Gradient boosting shines with small amounts of data and diverse data sources. The two play well together, as we can implement the output of deep learning models into the boosting models.
The mainstay of boosted trees are XGBoost, which is probably going to be the model performing the best.
XGBoost
CatBoost
Pool
Categorical features encoding parameters in CatBoost
The amount of parameters related to categorical feature processing in CatBoost is overwhelming. Here is hopefully the full list:
one_hot_max_size
(int) - use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. No complex encoding is performed for such features. Default for regression task is 2.model_size_reg
(float from 0 to inf) - The model size regularization coefficient. The larger the value, the smaller the model size. Refer to the Model size regularization coefficient section for details. This regularization is needed only for models with categorical features (other models are small). Models with categorical features might weight tens of gigabytes or more if categorical features have a lot of values. If the value of the regularizer differs from zero, then the usage of categorical features or feature combinations with a lot of values has a penalty, so fewer of them are used in the resulting model. Default value is 0.5max_ctr_complexity
- The maximum number of features that can be combined. Each resulting combination consists of one or more categorical features and can optionally contain binary features in the following form: “numeric feature > value”. For regression task on CPU the default value is 4.has_time
(bool) - iftrue
, the 1-st step of categorical features processing, permutation, is not performed. Useful when the objects in your dataset are ordered by time. For our dataset, we don't need it. Default value isFalse
simple_ctr
- Quantization settings for simple categorical features.combinations_ctr
- Quantization settings for combinations of categorical features.per_feature_ctr
- Per-feature quantization settings for categorical features.counter_calc_method
determines whether to use validation dataset(provided through parametereval_set
offit
method) to estimate categories frequencies withCounter
. By default, it isFull
and the objects from validation dataset are used; PassSkipTest
value to ignore the objects from the validation setctr_target_border_count
- The maximum number of borders to use in target quantization for categorical features that need it. Default for regression task is 1.ctr_leaf_count_limit
- The maximum number of leaves with categorical features. Default value is None i.e. no limit.store_all_simple_ctr
- If the previous parameterctr_leaf_count_limit
at some point gradient boosting tree can no longer make splits by categorical features. With Default valueFalse
the limitation applies both to original categorical features and the features, that CatBoost creates by combining different features. If this parameter is set toTrue
only the number of splits made on combination features is limited.
The three parameters simple_ctr
, combinations_ctr
, and per_feature_ctr
are complex parameters that control the second and the third steps of categorical features processing. We will talk about them more in the next sections.